The news articles were distributed over the below categories.These categories are nothing but the keywords that were used to collect the articles by the GDELT project. Although we do not intend to use these labels assigned to each article, but in order to avoid any biased results, we’ve taken a fair distribution of articles from every set.
…
Our Dataset is in text format and therefore we pre-processed it before performing any kind of exploratory analysis. This was required in order to clean it and remove unnecessary words or characters that would affect our analysis in any way.Pre-processing is one of the very important steps of Natural Language processing, because a well pre-processed data speeds up the computation time required for further analysis and also the quality of tokens and results tend to be higher compared to the poorly pre-processed data.
Steps taken for Pre-processing
Removed URL’s from the content
Replaced punctuations, numbers and any other characters apart from alphabets
Coverted Latin words to Utf-8
Conerted the text to lower case
Removed Stop words
Wordclouds are a representative of underlying words in any text or the news articles dataset in our case.We are interested in knowing the most prominent words in the corpus. To do so we generated wordclouds for 2 different models of Bag of Words, that are with Term Frequency and Term frequency - Inverse document frequency.
As we can see in the below wordcloud, news articles have been all about the coronavirus pandemic. The Terms with higher frequencies have bigger fonts.The words from Bag Of Words Model are more evident in the word cloud since they are weighted by the term frequency. The words in the TfIdf model is weighted according to the TF-IDF scale, so they look uniform.
wordcloud2(df_bow_content,shape = "star",size = 0.4)
…
wordcloud2(df_tfidf_content,shape = "star",size = 0.15)
…
To understand the most prominent terms in the article titles, we created a word cloud for the titles of the articles in the corpus.
wordcloud2(df_bow_title,shape = "star",size = 0.15)
…
The document length for all the documents are represented in the form of scatter plot. Most of the documents are in the range of 0-12500.
doc_size<-ggplot(test_df, aes(x=ID, y=doc_length)) +geom_bar(stat="identity",aes(fill = ID))+theme_minimal()+ labs(y= "Size", x = "Document")
ggsave("Document_Size_Bar.png", plot = doc_size,height = 5, width = 7)
doc_size_scatter <- ggplot(test_df) + aes(x = X1, y = doc_length) +geom_point(size = 1L, colour = "#0c4c8a") +labs(x = "Document", y = "Document Length") +theme_minimal()
…
The unique words present in the document was represented via the density plot
density <- test_df %>%
ggplot( aes(x=unique_words)) +
geom_density(fill="#009E73", color="#F0E442", alpha=0.8)+ labs(y= "Document", x = "Frequency")
density + coord_cartesian(xlim=c(0,20000))
theme(axis.text.x = element_text(face = "bold", color = "#993333",
size = 12, angle = 45)
…
The below words were the top 10 most frequent occurring words.
docs <- Corpus(VectorSource(test_df$pre_process_content))
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
params <- list(minDocFreq = 1,removeNumbers = TRUE,stopwords = TRUE,stemming = FALSE,weighting = weightTf,tokenize=UnigramTokenizer)
dtm <- DocumentTermMatrix(docs, control = params)
dtm <- removeSparseTerms(dtm, 0.99)
rowTotals <- apply(dtm, 1, sum) #Find the sum of words in each Document
dtm <- dtm[rowTotals> 0, ]
dtm_uni_freq <-dtm %>%
as.matrix %>%
colSums %>%
sort(decreasing=TRUE)
dtm_uni_freq_d <- data.frame(word = names(dtm_uni_freq), freq = dtm_uni_freq)
head(dtm_uni_freq_d, 10)
…
We now see the number of articles published over the duration of January to April.
p <- test_df2 %>%
ggplot(aes(x=Date, y=Count,group = 1)) +
geom_area(fill="#69b3a2", alpha=0.5) +
geom_line(color="#69b3a2") +
ylab("Article Count") +
theme_ipsum() +
theme(axis.text.x = element_text(face = "bold", color = "azure4",
size = 8, angle = 90),panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p <- ggplotly(p)
Let us see the overall sentiment in the published articles
ggplot(test_df1, aes(x=ID, y=as.integer(sentiment))) +
geom_segment( aes(x=ID, xend=ID, y=0, yend=as.integer(sentiment), color=mycolor), size=1.3, alpha=0.9) +
theme_light() +
theme(
legend.position = "none",
panel.border = element_blank(),
) +
labs(y= "Sentiment", x = "Document")
…
In order to get a better understanding of the most prevalent emotions in the articles, we have visualized the strength of the emotion in the corpus
quickplot(Emotions,data=all_emot, weight=count, geom="bar",fill=Emotions, ylab="count")+ggtitle("Emotion Analysis")
…